Code
remotes::install_github("jr-leary7/scLANE"){scLANE}Jack Leary
University of Florida
June 1, 2023
In this tutorial we’ll walk through a basic trajectory differential expression analysis. We’ll use the {scLANE} package, which we developed with the goal of providing accurate and biologically interpretable models of expression over the course of a biological process. At the end are a list of references we used in developing the method & writing the accompanying manuscript, as well as the poster I presented at ENAR 2023 in Nashville.
If you haven’t already, install the development version (currently v0.6.2) of {scLANE} from the GitHub repository.
Next, we’ll load the packages we need to process, analyze, & visualize our data.
library(dplyr) # data manipulation
library(scLANE) # trajectory DE
library(Seurat) # scRNA-seq tools
library(ggplot2) # plot utilities
library(patchwork) # plot combination
library(slingshot) # pseudotime estimation
library(reticulate) # Python interface
library(ComplexHeatmap) # heatmaps
rename <- dplyr::renameWe’ll also define a couple utilities to make our plots cleaner to read & easier to make.
theme_umap <- function(base.size = 14) {
ggplot2::theme_classic(base_size = base.size) +
ggplot2::theme(axis.ticks = ggplot2::element_blank(),
axis.text = ggplot2::element_blank(),
plot.subtitle = ggplot2::element_text(face = "italic", size = 11),
plot.caption = ggplot2::element_text(face = "italic", size = 11))
}
guide_umap <- function(key.size = 4) {
ggplot2::guides(color = ggplot2::guide_legend(override.aes = list(size = key.size, alpha = 1)))
}And consistent color palettes will make our plots easier to understand.
We’ll load the pancreatic endocrinogenesis data from Bastidas-Ponce et al (2019), which comes with the scVelo Python library & has been used in several pseudotime inference / RNA velocity method papers as a good benchmark dataset.
0%| | 0.00/50.0M [00:00<?, ?B/s]
1%|1 | 664k/50.0M [00:00<00:07, 6.80MB/s]
3%|2 | 1.38M/50.0M [00:00<00:07, 7.01MB/s]
5%|5 | 2.53M/50.0M [00:00<00:05, 9.30MB/s]
7%|6 | 3.36M/50.0M [00:00<00:05, 9.01MB/s]
9%|8 | 4.34M/50.0M [00:00<00:05, 9.45MB/s]
10%|# | 5.12M/50.0M [00:00<00:05, 9.02MB/s]
12%|#2 | 6.09M/50.0M [00:00<00:04, 9.33MB/s]
14%|#4 | 7.17M/50.0M [00:00<00:04, 9.93MB/s]
16%|#5 | 8.00M/50.0M [00:00<00:04, 9.39MB/s]
18%|#7 | 8.96M/50.0M [00:01<00:04, 9.59MB/s]
20%|#9 | 9.97M/50.0M [00:01<00:04, 9.83MB/s]
22%|##1 | 10.8M/50.0M [00:01<00:04, 9.41MB/s]
23%|##3 | 11.7M/50.0M [00:01<00:04, 9.41MB/s]
25%|##5 | 12.8M/50.0M [00:01<00:03, 9.97MB/s]
27%|##7 | 13.5M/50.0M [00:01<00:04, 9.40MB/s]
29%|##8 | 14.3M/50.0M [00:01<00:04, 9.03MB/s]
30%|### | 15.1M/50.0M [00:01<00:04, 8.75MB/s]
32%|###2 | 16.1M/50.0M [00:01<00:03, 9.22MB/s]
34%|###4 | 17.0M/50.0M [00:01<00:03, 9.38MB/s]
36%|###5 | 18.0M/50.0M [00:02<00:03, 9.64MB/s]
38%|###8 | 19.1M/50.0M [00:02<00:03, 9.99MB/s]
40%|###9 | 19.8M/50.0M [00:02<00:03, 9.44MB/s]
41%|####1 | 20.7M/50.0M [00:02<00:03, 9.17MB/s]
43%|####3 | 21.6M/50.0M [00:02<00:03, 9.32MB/s]
45%|####5 | 22.5M/50.0M [00:02<00:03, 9.50MB/s]
47%|####6 | 23.4M/50.0M [00:02<00:02, 9.40MB/s]
49%|####8 | 24.3M/50.0M [00:02<00:02, 9.24MB/s]
51%|##### | 25.3M/50.0M [00:02<00:02, 9.70MB/s]
52%|#####2 | 26.1M/50.0M [00:02<00:02, 9.17MB/s]
54%|#####3 | 27.0M/50.0M [00:03<00:02, 9.35MB/s]
56%|#####5 | 27.9M/50.0M [00:03<00:02, 9.27MB/s]
58%|#####7 | 28.8M/50.0M [00:03<00:02, 9.44MB/s]
59%|#####9 | 29.6M/50.0M [00:03<00:02, 9.09MB/s]
61%|######1 | 30.5M/50.0M [00:03<00:02, 9.20MB/s]
63%|######3 | 31.6M/50.0M [00:03<00:01, 9.73MB/s]
65%|######5 | 32.7M/50.0M [00:03<00:01, 10.2MB/s]
67%|######6 | 33.5M/50.0M [00:03<00:01, 9.67MB/s]
69%|######8 | 34.4M/50.0M [00:03<00:01, 9.69MB/s]
71%|####### | 35.4M/50.0M [00:03<00:01, 9.99MB/s]
73%|#######2 | 36.5M/50.0M [00:04<00:01, 10.2MB/s]
75%|#######4 | 37.4M/50.0M [00:04<00:01, 10.1MB/s]
77%|#######6 | 38.4M/50.0M [00:04<00:01, 10.0MB/s]
78%|#######8 | 39.3M/50.0M [00:04<00:01, 9.86MB/s]
80%|#######9 | 40.0M/50.0M [00:04<00:01, 9.07MB/s]
82%|########1 | 40.9M/50.0M [00:04<00:01, 9.26MB/s]
84%|########3 | 41.8M/50.0M [00:04<00:00, 9.30MB/s]
85%|########5 | 42.8M/50.0M [00:04<00:00, 9.43MB/s]
87%|########7 | 43.7M/50.0M [00:04<00:00, 9.41MB/s]
89%|########9 | 44.7M/50.0M [00:04<00:00, 9.59MB/s]
91%|#########1| 45.6M/50.0M [00:05<00:00, 9.51MB/s]
93%|#########2| 46.4M/50.0M [00:05<00:00, 9.31MB/s]
95%|#########4| 47.4M/50.0M [00:05<00:00, 9.53MB/s]
97%|#########6| 48.3M/50.0M [00:05<00:00, 9.59MB/s]
98%|#########8| 49.1M/50.0M [00:05<00:00, 9.13MB/s]
100%|##########| 50.0M/50.0M [00:05<00:00, 9.44MB/s]
The AnnData object contains data that we’ll need to extract, specifically the counts matrices (stored in AnnData.layers) and the cell-level metadata (which is in AnnData.obs).
AnnData object with n_obs × n_vars = 3696 × 27998
obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score'
var: 'highly_variable_genes'
uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
obsp: 'distances', 'connectivities'
The {reticulate} package allows us to pass the counts matrices & metadata from Python back to R. We’ll use the spliced mRNA counts as our default assay, and also define a new assay containing the total (spliced + unspliced) mRNA in each cell. Lastly, we remove genes with non-zero spliced mRNA in 3 or fewer cells. Note: while downloading this dataset requires a Python installation as well as the installation of the scVelo Python library (and its dependencies), running {scLANE} is done purely in R & requires no Python whatsoever.
spliced_counts <- Matrix::Matrix(t(as.matrix(py$adata$layers["spliced"])), sparse = TRUE)
unspliced_counts <- Matrix::Matrix(t(as.matrix(py$adata$layers["unspliced"])), sparse = TRUE)
rna_counts <- spliced_counts + unspliced_counts
colnames(rna_counts) <- colnames(spliced_counts) <- colnames(unspliced_counts) <- py$adata$obs_names$to_list()
rownames(rna_counts) <- rownames(spliced_counts) <- rownames(unspliced_counts) <- py$adata$var_names$to_list()
spliced_assay <- CreateAssayObject(counts = spliced_counts)
spliced_assay@key <- "spliced_"
unspliced_assay <- CreateAssayObject(counts = unspliced_counts)
unspliced_assay@key <- "unspliced_"
rna_assay <- CreateAssayObject(counts = rna_counts)
rna_assay@key <- "rna_"
meta_data <- py$adata$obs %>%
mutate(cell_name = rownames(.), .before = 1) %>%
rename(celltype = clusters,
celltype_coarse = clusters_coarse) %>%
mutate(nCount_spliced = colSums(spliced_counts),
nFeature_spliced = colSums(spliced_counts > 0),
nCount_unspliced = colSums(unspliced_counts),
nFeature_unspliced = colSums(unspliced_counts > 0),
nCount_rna = colSums(rna_counts),
nFeature_rna = colSums(rna_counts > 0))
seu <- CreateSeuratObject(counts = spliced_assay,
assay = "spliced",
project = "Mm_Panc_Endo",
meta.data = meta_data)
seu@assays$unspliced <- unspliced_assay
seu@assays$rna <- rna_assay
seu <- seu[rowSums(seu@assays$spliced) > 3, ]We preprocess the counts using a typical pipeline with QC, normalization, linear & nonlinear dimension reduction, and graph-based clustering via the Leiden algorithm.
seu <- PercentageFeatureSet(seu,
pattern = "^mt-",
col.name = "percent_mito",
assay = "spliced") %>%
PercentageFeatureSet(pattern = "^Rp[sl]",
col.name = "percent_ribo",
assay = "spliced") %>%
NormalizeData(assay = "spliced", verbose = FALSE) %>%
NormalizeData(assay = "unspliced", verbose = FALSE) %>%
NormalizeData(assay = "rna", verbose = FALSE) %>%
FindVariableFeatures(assay = "spliced",
nfeatures = 3000,
verbose = FALSE) %>%
ScaleData(assay = "spliced",
vars.to.regress = c("percent_mito", "percent_ribo"),
model.use = "poisson",
verbose = FALSE) %>%
RunPCA(assay = "spliced",
npcs = 30,
approx = TRUE,
seed.use = 312,
verbose = FALSE) %>%
RunUMAP(reduction = "pca",
dims = 1:30,
n.components = 2,
metric = "cosine",
seed.use = 312,
verbose = FALSE) %>%
FindNeighbors(reduction = "pca",
k.param = 30,
nn.method = "annoy",
annoy.metric = "cosine",
verbose = FALSE) %>%
FindClusters(algorithm = 4,
resolution = 0.5,
random.seed = 312,
verbose = FALSE)Let’s visualize the results on our UMAP embedding. The clustering generally agrees with the celltype labels, though there is some overclustering in the ductal cells & underclustering in the mature endocrine celltypes.
p0 <- DimPlot(seu,
group.by = "seurat_clusters",
pt.size = 1,
cols = alpha(palette_cluster, 0.75)) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Leiden Cluster") +
theme_umap() +
theme(plot.title = element_blank()) +
guide_umap()
p1 <- DimPlot(seu,
group.by = "celltype",
pt.size = 1,
cols = alpha(palette_celltype, 0.75)) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Celltype") +
theme_umap() +
theme(plot.title = element_blank()) +
guide_umap()
p2 <- (p0 / p1) +
plot_annotation(title = "Murine Pancreatic Endocrinogenesis",
theme = theme_classic(base_size = 14))
p2We’ll start by fitting a trajectory using the {slingshot} R package. We define cluster 4 as the starting cluster, since in this case we’re already aware of the dataset’s underlying biology. After generating the estimates for each cell, we rescale the ordering to be defined on \([0, 1]\). This has no effect on the trajectory DE results however, and is mostly an aesthetic choice.
sling_res <- slingshot(Embeddings(seu, "umap"),
clusterLabels = seu$seurat_clusters,
start.clus = "4",
approx_points = 1000)
sling_pt <- slingPseudotime(sling_res) %>%
as.data.frame() %>%
magrittr::set_colnames(c("PT")) %>%
mutate(PT = (PT - min(PT)) / (max(PT) - min(PT)))
seu <- AddMetaData(seu,
metadata = sling_pt,
col.name = "sling_pt")Let’s visualize the results on our UMAP embedding. They match what we would expect (knowing the biological background of the data), with ductal cells at the start of the process and endocrine celltypes such as alpha, beta, & delta cells at the end of it.
p3 <- Embeddings(seu, "umap") %>%
as.data.frame() %>%
magrittr::set_colnames(c("UMAP_1", "UMAP_2")) %>%
mutate(PT = sling_pt$PT) %>%
ggplot(aes(x = UMAP_1, y = UMAP_2, color = PT)) +
geom_point(size = 1, alpha = 0.75) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Pseudotime") +
scale_color_gradientn(colors = palette_heatmap,
labels = scales::label_number(accuracy = 0.1)) +
theme_umap()
p4 <- (p3 / p1) +
plot_annotation(title = "Estimated Cell Ordering from Slingshot",
theme = theme_classic(base_size = 14))
p4Next, we prepare the primary inputs to {scLANE}: a dense counts matrix (with cells as rows and genes as columns - this is important), a dataframe containing our estimated pseudotime ordering, and a character vector of the genes that we’re interested in modeling. We parallelize over genes in order to speed up the computation at the expense of using a little more memory. The models are fit using NB GLMs with optimal spline knots identified empirically, and differential expression is quantified using a likelihood ratio test of the fitted model vs. the null (intercept-only) model. In practice, genes designated as HVGs are usually the best candidates for modeling, so we choose the top 3,000 HVGs as our input. Note: the testing of the HVG set on its own is also justified by the reality that almost all trajectories are inferred using some sort of dimension-reduced space, and those embeddings are nearly universally generated using a set of HVGs. As such, genes not included in the HVG set actually have no direct relationship with the estimated trajectory, & it’s generally safe to exclude them from trajectory analyses.
top3k_hvg <- HVFInfo(seu) %>%
arrange(desc(variance.standardized)) %>%
slice_head(n = 3000) %>%
rownames(.)
raw_counts <- t(as.matrix(seu@assays$spliced@counts[top3k_hvg, ]))
scLANE_res <- testDynamic(expr.mat = raw_counts,
pt = sling_pt,
n.potential.basis.fns = 4,
parallel.exec = TRUE,
n.cores = 5,
track.time = TRUE)[1] "testDynamic evaluated 3000 genes with 1 lineage apiece in 12.507 mins"
We pull a sample of 10 genes from the results (which we clean up using getResultsDE()) & display their test statistics. By default, any gene with an adjusted p-value less than 0.01 is predicted to be dynamic, though this threshold can be easily adjusted.
scLANE_res_tidy <- getResultsDE(test.dyn.results = scLANE_res)
set.seed(629)
select(scLANE_res_tidy,
Gene,
Test_Stat,
P_Val,
P_Val_Adj,
Gene_Dynamic_Overall) %>%
mutate(Gene_Dynamic_Overall = if_else(Gene_Dynamic_Overall == 1, "Dynamic", "Static")) %>%
slice_sample(n = 10) %>%
kableExtra::kbl(digits = 5,
booktabs = TRUE,
col.names = c("Gene", "LRT Statistic", "P-value", "Adj. P-value", "Predicted Gene Status")) %>%
kableExtra::kable_classic(full_width = FALSE, "hover")| Gene | LRT Statistic | P-value | Adj. P-value | Predicted Gene Status |
|---|---|---|---|---|
| Cdt1 | 683.50332 | 0.00000 | 0.00000 | Dynamic |
| Plet1 | 51.35960 | 0.00000 | 0.00000 | Dynamic |
| Rnf130 | 687.54249 | 0.00000 | 0.00000 | Dynamic |
| Cib2 | 325.64929 | 0.00000 | 0.00000 | Dynamic |
| Gpx1 | 1754.86938 | 0.00000 | 0.00000 | Dynamic |
| Gm15440 | 7.48582 | 0.00622 | 1.00000 | Static |
| Suclg2 | 797.59131 | 0.00000 | 0.00000 | Dynamic |
| Sox5 | 239.85120 | 0.00000 | 0.00000 | Dynamic |
| Ctsk | 14.86037 | 0.00059 | 0.23782 | Static |
| Mpp2 | 130.12658 | 0.00000 | 0.00000 | Dynamic |
Next, we can use the plotModels() function to visualize the fitted models from {scLANE} and compare them to other modeling methods. The gene Neurog3 is strongly associated with epithelial cell differentiation, and indeed we see a very clear, nonlinear transcriptional dynamic across pseudotime for that gene. A traditional GLM fails to capture that nonlinearity, and a GAM over-smooths the trend and does not accurately model the sharpness of the transcriptional switch that occurs halfway through the trajectory. Only the scLANE model accurately models the rapid upregulation and equally swift downregulation of Neurog3 over pseudotime thanks to its adaptive choice of knots & piecewise linear nature.
We can check out the actual regression output for our gene of interest as well. The estimated knot is placed at 0.4386.
scLANE_res$Neurog3$Lineage_A$MARGE_Summary %>%
mutate(term = gsub("B_final", "", term),
term = if_else(term == "Intercept", term, paste0("h", term))) %>%
kableExtra::kbl(digits = 3,
booktabs = TRUE,
caption = "scLANE Model Output for <i>Neurog3<\\i>",
col.names = c("Hinge Function", "Coefficient", "Std. Error", "T-statistic", "P-value")) %>%
kableExtra::kable_classic(full_width = FALSE, "hover")| Hinge Function | Coefficient | Std. Error | T-statistic | P-value |
|---|---|---|---|---|
| Intercept | 3.359 | 0.073 | 46.046 | 0 |
| h(Lineage_A-0.4386) | -8.473 | 0.221 | -38.422 | 0 |
| h(0.4386-Lineage_A) | -8.272 | 0.291 | -28.418 | 0 |
Using the getFittedValues() function allows us to generate predictions from the models we fit, which we then use to visualize the dynamics of a few genes that are known to be strongly associated with the differentiation of immature cells into mature endocrine phenotypes. For all four genes, the fitted models show knots chosen in the area of pseudotime around the pre-endocrine cells. This tells us that these driver genes are being upregulated in precursor celltypes & are driving differentiation into the mature celltypes such as alpha & beta cells, after which the genes are downregulated.
p6 <- getFittedValues(test.dyn.res = scLANE_res,
genes = c("Chga", "Chgb", "Fev", "Cck"),
pt = sling_pt,
expr.mat = raw_counts,
cell.meta.data = select(seu@meta.data, celltype, celltype_coarse)) %>%
ggplot(aes(x = pt, y = expression)) +
facet_wrap(~gene,
ncol = 2,
scales = "free_y") +
geom_point(aes(color = celltype), size = 1, alpha = 0.75) +
geom_vline(data = data.frame(gene = "Chga", knot = unique(scLANE_res$Chga$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Chgb", knot = unique(scLANE_res$Chgb$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Cck", knot = unique(scLANE_res$Cck$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Fev", knot = unique(scLANE_res$Fev$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_ribbon(aes(ymin = scLANE_ci_ll, ymax = scLANE_ci_ul),
size = 0,
fill = "grey",
alpha = 1) +
geom_line(aes(y = scLANE_fitted),
color = "black",
size = 0.75) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.1)) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime",
y = "Expression",
title = "Endrocrinogenesis Driver Genes Across Pseudotime",
subtitle = "scLANE piecewise negative binomial GLMs") +
theme_classic(base_size = 14) +
theme(legend.title = element_blank(),
strip.text.x = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic", size = 11)) +
guide_umap()
p6On the other hand, if we use additive models the “peak” of expression is placed among the mature endocrine celltypes - which doesn’t make biological sense if we know that these genes are driving that process of differentiation. This can of course be tweaked by changing the degree or degrees of freedom of the underlying basis spline, but choosing a “best” value for those hyperparameters can be difficult, whereas scLANE identifies optimal parameters internally by default.
p7 <- getFittedValues(test.dyn.res = scLANE_res,
genes = c("Chga", "Chgb", "Fev", "Cck"),
pt = sling_pt,
expr.mat = raw_counts,
cell.meta.data = select(seu@meta.data, celltype, celltype_coarse)) %>%
with_groups(gene,
mutate,
GAM_fitted_link = predict(gamlss::gamlss(expression ~ splines::bs(pt, degree = 3),
family = "NBI",
control = gamlss::gamlss.control(trace = FALSE))),
GAM_se_link = predict(gamlss::gamlss(expression ~ splines::bs(pt, degree = 3),
family = "NBI",
control = gamlss::gamlss.control(trace = FALSE)),
se.fit = TRUE)[[2]]) %>%
mutate(GAM_fitted = exp(GAM_fitted_link),
GAM_ci_ll = exp(GAM_fitted_link - qnorm(0.975, lower.tail = FALSE) * GAM_se_link),
GAM_ci_ul = exp(GAM_fitted_link + qnorm(0.975, lower.tail = FALSE) * GAM_se_link)) %>%
ggplot(aes(x = pt, y = expression)) +
facet_wrap(~gene,
ncol = 2,
scales = "free_y") +
geom_point(aes(color = celltype), size = 1, alpha = 0.75) +
geom_ribbon(aes(ymin = GAM_ci_ll, ymax = GAM_ci_ul),
size = 0,
fill = "grey",
alpha = 1) +
geom_line(aes(y = GAM_fitted),
color = "black",
size = 0.75) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.1)) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime",
y = "Expression",
title = "Endrocrinogenesis Driver Genes Across Pseudotime",
subtitle = "Cubic basis spline negative binomial GAMs") +
theme_classic(base_size = 14) +
theme(legend.title = element_blank(),
strip.text.x = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic", size = 11)) +
guide_umap()
p7Let’s take a broader view of the dataset by examining the distribution of adaptively chosen knots from our models. We limit the analysis to the set of genes determined to be dynamic.
We’ll plot a histogram of the knot values along with a ridgeplot of the pseudotime distribution for each celltype. We see that the majority of the selected knots are placed at the beginning of the trajectory, around where the ductal cells transition into endocrine progenitors. A smaller set of knots is placed about halfway through the trajectory, which we’ve annotated as the point at which pre-endocrine cells begin differentiating into mature endocrine phenotypes.
p8 <- ggplot(knot_df, aes(x = knot)) +
geom_density(fill = "deepskyblue3",
alpha = 0.75,
color = "deepskyblue4",
size = 1) +
scale_x_continuous(limits = c(0, 1), labels = scales::label_number(accuracy = 0.1)) +
labs(x = "Knot Location") +
theme_classic(base_size = 14) +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
p9 <- data.frame(celltype = seu$celltype,
pt = seu$sling_pt) %>%
ggplot(aes(x = pt, y = celltype, fill = celltype, color = celltype)) +
ggridges::geom_density_ridges(alpha = 0.75, size = 1) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.1)) +
scale_fill_manual(values = palette_celltype) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime") +
theme_classic(base_size = 14) +
theme(axis.title.y = element_blank(),
legend.title = element_blank()) +
guide_umap()
p10 <- (p8 / p9) +
plot_layout(heights = c(1, 1.75)) +
plot_annotation(title = "Distribution of Adaptively-chosen Knots from scLANE",
theme = theme_classic(base_size = 14))
p10Picking joint bandwidth of 0.0184
We can extract a matrix of fitted values using smoothedCountsMatrix(); here we focus on the top 1,000 most dynamic genes, with the goal of identifying clusters of similarly-expressed genes. After reducing dimensionality with PCA, we cluster the genes using the Leiden algorithm & embed the genes in two dimensions with UMAP.
smoothed_counts <- smoothedCountsMatrix(scLANE_res,
genes = dyn_genes[1:2000],
parallel.exec = TRUE,
n.cores = 2)
set.seed(312)
smoothed_counts_pca <- irlba::prcomp_irlba(t(smoothed_counts$Lineage_A),
n = 30,
center = TRUE,
scale. = TRUE)
smoothed_counts_umap <- uwot::umap(smoothed_counts_pca$x,
n_components = 2,
metric = "cosine",
n_neighbors = 20,
init = "spectral")
smoothed_counts_snn <- bluster::makeSNNGraph(smoothed_counts_pca$x,
k = 20,
type = "jaccard",
BNPARAM = BiocNeighbors::AnnoyParam(distance = "Cosine"))
smoothed_counts_clust <- igraph::cluster_leiden(smoothed_counts_snn,
objective_function = "modularity",
resolution_parameter = 0.3)
gene_clust_df <- data.frame(gene = colnames(smoothed_counts$Lineage_A),
umap1 = smoothed_counts_umap[, 1],
umap2 = smoothed_counts_umap[, 2],
leiden = as.factor(smoothed_counts_clust$membership - 1L))The embedding & clustering show that even with the relatively small number of genes, clear patterns are visible.
p11 <- ggplot(gene_clust_df, aes(x = umap1, y = umap2, color = leiden)) +
geom_point(size = 1, alpha = 0.75) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Leiden Cluster",
title = "Unsupervised Clustering of Dynamic Genes",
subtitle = "Top 2,000 genes after PCA") +
paletteer::scale_color_paletteer_d("ggsci::default_igv") +
theme_umap() +
guide_umap()
p11We can also plot a heatmap of the dynamic genes; this requires a bit of setup, for which we’ll use the {ComplexHeatmap} package. We scale each gene, and clip values to be on \([-6, 6]\). The columns of the heatmap are ordered by estimated pseudotime.
col_anno_df <- select(seu@meta.data,
cell_name,
celltype,
sling_pt) %>%
mutate(celltype = as.factor(celltype)) %>%
arrange(sling_pt)
heatmap_mat <- t(scale(smoothed_counts$Lineage_A))
heatmap_mat[heatmap_mat > 6] <- 6
heatmap_mat[heatmap_mat < -6] <- -6
colnames(heatmap_mat) <- seu$cell_name
heatmap_mat <- heatmap_mat[, col_anno_df$cell_name]
palette_celltype_hm <- as.character(palette_celltype[1:length(unique(seu$celltype))])
names(palette_celltype_hm) <- levels(col_anno_df$celltype)
col_anno <- HeatmapAnnotation(Celltype = col_anno_df$celltype,
Pseudotime = col_anno_df$sling_pt,
col = list(Celltype = palette_celltype_hm,
Pseudotime = circlize::colorRamp2(seq(0, 1, by = 0.25), palette_heatmap)),
show_legend = TRUE,
show_annotation_name = FALSE,
gap = unit(1, "mm"),
border = TRUE)
palette_cluster_hm <- as.character(paletteer::paletteer_d("ggsci::default_igv")[1:length(unique(gene_clust_df$leiden))])
names(palette_cluster_hm) <- as.character(unique(gene_clust_df$leiden))
row_anno <- HeatmapAnnotation(Cluster = as.factor(gene_clust_df$leiden),
col = list(Cluster = palette_cluster_hm),
show_legend = TRUE,
show_annotation_name = FALSE,
annotation_legend_param = list(title = "Gene\nCluster"),
gap = unit(1, "mm"),
border = TRUE,
which = "row")The heatmap shows clear dynamic patterns across pseudotime, and the hierarchical clustering agrees fairly well with our graph-based clustering from earlier.
Heatmap(matrix = heatmap_mat,
name = "Spliced\nmRNA",
col = circlize::colorRamp2(colors = viridis::inferno(50),
breaks = seq(min(heatmap_mat), max(heatmap_mat), length.out = 50)),
cluster_columns = FALSE,
show_column_dend = FALSE,
cluster_column_slices = FALSE,
width = 12,
height = 6,
column_title = "Dynamic Genes Across Pseudotime in Murine Pancreatic Endocrinogenesis",
column_title_gp = gpar(fontface = "bold"),
show_row_dend = TRUE,
top_annotation = col_anno,
left_annotation = row_anno,
show_column_names = FALSE,
show_row_names = FALSE,
use_raster = TRUE,
raster_by_magick = TRUE,
raster_quality = 5)Loading required namespace: magick
Using our gene clusters & the {gprofiler2} package, we run an enrichment analysis against the biological process (BP) set of gene ontologies.
gene_clust_list <- purrr::map(unique(gene_clust_df$leiden), \(x) filter(gene_clust_df, leiden == x) %>% pull(gene))
names(gene_clust_list) <- paste0("Leiden_", unique(gene_clust_df$leiden))
enrich_res <- gprofiler2::gost(gene_clust_list,
organism = "mmusculus",
ordered_query = FALSE,
multi_query = FALSE,
sources = "GO:BP",
significant = TRUE)A look at the top 3 most-significant GO terms for each gene cluster reveals heterogeneous functionalities across groups of genes:
mutate(enrich_res$result,
query = gsub("Leiden_", "", query)) %>%
rename(cluster = query) %>%
with_groups(cluster,
slice_head,
n = 3) %>%
select(cluster, term_name, p_value, term_size, query_size, intersection_size, term_id) %>%
kableExtra::kbl(digits = 5,
booktabs = TRUE,
caption = "<i>Top 3 Biological Process GO Terms per Cluster<\\i>",
col.names = c("Gene Cluster", "Term Name", "Adj. P-value", "Term Size",
"Query Size", "Intersection Size", "Term ID")) %>%
kableExtra::kable_classic(c("hover"), full_width = FALSE)| Gene Cluster | Term Name | Adj. P-value | Term Size | Query Size | Intersection Size | Term ID |
|---|---|---|---|---|---|---|
| 0 | nervous system development | 0.00000 | 2555 | 271 | 80 | GO:0007399 |
| 0 | regulation of secretion by cell | 0.00000 | 673 | 271 | 41 | GO:1903530 |
| 0 | amide transport | 0.00000 | 423 | 271 | 33 | GO:0042886 |
| 1 | regulation of apoptotic process | 0.00000 | 1633 | 252 | 58 | GO:0042981 |
| 1 | regulation of programmed cell death | 0.00000 | 1665 | 252 | 58 | GO:0043067 |
| 1 | regulation of cell death | 0.00000 | 1836 | 252 | 61 | GO:0010941 |
| 2 | cell-cell signaling | 0.00000 | 1748 | 308 | 64 | GO:0007267 |
| 2 | regulation of biological quality | 0.00000 | 3244 | 308 | 91 | GO:0065008 |
| 2 | secretion | 0.00000 | 1084 | 308 | 48 | GO:0046903 |
| 3 | cell cycle | 0.00000 | 1817 | 285 | 174 | GO:0007049 |
| 3 | cell cycle process | 0.00000 | 1242 | 285 | 152 | GO:0022402 |
| 3 | mitotic cell cycle | 0.00000 | 881 | 285 | 121 | GO:0000278 |
| 4 | animal organ development | 0.00000 | 3326 | 307 | 106 | GO:0048513 |
| 4 | system development | 0.00000 | 4115 | 307 | 119 | GO:0048731 |
| 4 | developmental process | 0.00000 | 6865 | 307 | 158 | GO:0032502 |
| 5 | multicellular organism development | 0.00002 | 4873 | 145 | 55 | GO:0007275 |
| 5 | system development | 0.00003 | 4115 | 145 | 49 | GO:0048731 |
| 5 | anatomical structure development | 0.00014 | 6228 | 145 | 62 | GO:0048856 |
| 6 | anatomical structure development | 0.00000 | 6228 | 238 | 111 | GO:0048856 |
| 6 | multicellular organism development | 0.00000 | 4873 | 238 | 95 | GO:0007275 |
| 6 | tube development | 0.00000 | 1173 | 238 | 43 | GO:0035295 |
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os macOS Big Sur ... 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2023-06-01
pandoc 2.19.2 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
bigassertr 0.1.5 2021-07-08 [1] CRAN (R 4.2.0)
bigparallelr 0.3.2 2021-10-02 [1] CRAN (R 4.2.0)
bigstatsr 1.5.6 2022-02-03 [1] CRAN (R 4.2.0)
Biobase * 2.56.0 2022-04-26 [1] Bioconductor
BiocGenerics * 0.42.0 2022-04-26 [1] Bioconductor
BiocNeighbors 1.14.0 2022-04-26 [1] Bioconductor
BiocParallel 1.30.3 2022-06-05 [1] Bioconductor
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.2.0)
bluster 1.6.0 2022-04-26 [1] Bioconductor
boot 1.3-28 2021-05-03 [1] CRAN (R 4.2.1)
broom 1.0.0 2022-07-01 [1] CRAN (R 4.2.0)
broom.mixed 0.2.9.4 2022-04-17 [1] CRAN (R 4.2.0)
Cairo 1.6-0 2022-07-05 [1] CRAN (R 4.2.0)
circlize 0.4.15 2022-05-10 [1] CRAN (R 4.2.0)
cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0)
clue 0.3-61 2022-05-30 [1] CRAN (R 4.2.0)
cluster 2.1.4 2022-08-22 [1] CRAN (R 4.2.0)
coda 0.19-4 2020-09-30 [1] CRAN (R 4.2.0)
codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.1)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
ComplexHeatmap * 2.12.1 2022-08-09 [1] Bioconductor
cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.2.0)
crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0)
data.table 1.14.2 2021-09-27 [1] CRAN (R 4.2.0)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
DelayedArray 0.22.0 2022-04-26 [1] Bioconductor
DelayedMatrixStats 1.18.0 2022-04-26 [1] Bioconductor
deldir 1.0-6 2021-10-23 [1] CRAN (R 4.2.0)
digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
doParallel 1.0.17 2022-02-07 [1] CRAN (R 4.2.0)
dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
emmeans 1.8.0 2022-08-05 [1] CRAN (R 4.2.0)
estimability 1.4.1 2022-08-05 [1] CRAN (R 4.2.0)
evaluate 0.16 2022-08-09 [1] CRAN (R 4.2.0)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
fitdistrplus 1.1-8 2022-03-10 [1] CRAN (R 4.2.0)
flock 0.7 2016-11-12 [1] CRAN (R 4.2.0)
forcats 0.5.2 2022-08-19 [1] CRAN (R 4.2.0)
foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0)
furrr 0.3.1 2022-08-15 [1] CRAN (R 4.2.0)
future 1.27.0 2022-07-22 [1] CRAN (R 4.2.0)
future.apply 1.9.0 2022-04-25 [1] CRAN (R 4.2.0)
gamlss 5.4-3 2022-04-24 [1] CRAN (R 4.2.0)
gamlss.data 6.0-2 2021-11-07 [1] CRAN (R 4.2.0)
gamlss.dist 6.0-5 2022-08-28 [1] CRAN (R 4.2.1)
geeM 0.10.1 2018-06-18 [1] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
GenomeInfoDb * 1.32.3 2022-08-09 [1] Bioconductor
GenomeInfoDbData 1.2.8 2022-08-29 [1] Bioconductor
GenomicRanges * 1.48.0 2022-04-26 [1] Bioconductor
GetoptLong 1.0.5 2020-12-15 [1] CRAN (R 4.2.0)
ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0)
ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.2.0)
ggridges 0.5.3 2021-01-08 [1] CRAN (R 4.2.0)
glm2 * 1.2.1 2018-08-11 [1] CRAN (R 4.2.0)
glmmTMB 1.1.5 2022-11-16 [1] CRAN (R 4.2.0)
GlobalOptions 0.1.2 2020-06-10 [1] CRAN (R 4.2.0)
globals 0.16.1 2022-08-28 [1] CRAN (R 4.2.1)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
goftest 1.2-3 2021-10-07 [1] CRAN (R 4.2.0)
gprofiler2 0.2.1 2021-08-23 [1] CRAN (R 4.2.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0)
here 1.0.1 2020-12-13 [1] CRAN (R 4.2.0)
highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.0)
httpuv 1.6.5 2022-01-05 [1] CRAN (R 4.2.0)
httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
ica 1.0-3 2022-07-08 [1] CRAN (R 4.2.0)
igraph 1.3.4 2022-07-19 [1] CRAN (R 4.2.0)
IRanges * 2.30.1 2022-08-18 [1] Bioconductor
irlba 2.3.5 2021-12-06 [1] CRAN (R 4.2.0)
iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0)
jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.0)
kableExtra 1.3.4 2021-02-20 [1] CRAN (R 4.2.0)
KernSmooth 2.23-20 2021-05-03 [1] CRAN (R 4.2.1)
knitr 1.40 2022-08-24 [1] CRAN (R 4.2.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
later 1.3.0 2021-08-18 [1] CRAN (R 4.2.0)
lattice 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.2.0)
leiden 0.4.2 2022-05-09 [1] CRAN (R 4.2.0)
lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0)
listenv 0.8.0 2019-12-05 [1] CRAN (R 4.2.0)
lme4 1.1-30 2022-07-08 [1] CRAN (R 4.2.0)
lmtest 0.9-40 2022-03-21 [1] CRAN (R 4.2.0)
magick 2.7.3 2021-08-18 [1] CRAN (R 4.2.0)
magrittr * 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.0)
Matrix 1.4-1 2022-03-23 [1] CRAN (R 4.2.1)
MatrixGenerics * 1.8.1 2022-06-30 [1] Bioconductor
matrixStats * 0.62.0 2022-04-19 [1] CRAN (R 4.2.0)
mgcv 1.8-40 2022-03-29 [1] CRAN (R 4.2.1)
mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.0)
minqa 1.2.4 2014-10-09 [1] CRAN (R 4.2.0)
multcomp 1.4-20 2022-08-07 [1] CRAN (R 4.2.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
mvtnorm 1.1-3 2021-10-08 [1] CRAN (R 4.2.0)
nlme 3.1-159 2022-08-09 [1] CRAN (R 4.2.0)
nloptr 2.0.3 2022-05-26 [1] CRAN (R 4.2.0)
numDeriv 2016.8-1.1 2019-06-06 [1] CRAN (R 4.2.0)
paletteer 1.5.0 2022-10-19 [1] CRAN (R 4.2.0)
parallelly 1.32.1 2022-07-21 [1] CRAN (R 4.2.0)
patchwork * 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
pbapply 1.5-0 2021-09-16 [1] CRAN (R 4.2.0)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
plotly 4.10.0 2021-10-09 [1] CRAN (R 4.2.0)
plyr 1.8.7 2022-03-24 [1] CRAN (R 4.2.0)
png 0.1-7 2013-12-03 [1] CRAN (R 4.2.0)
polyclip 1.10-0 2019-03-14 [1] CRAN (R 4.2.0)
princurve * 2.1.6 2021-01-18 [1] CRAN (R 4.2.0)
prismatic 1.1.1 2022-08-15 [1] CRAN (R 4.2.0)
progressr 0.10.1 2022-06-03 [1] CRAN (R 4.2.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
ps 1.7.1 2022-06-18 [1] CRAN (R 4.2.0)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
RANN 2.6.1 2019-01-08 [1] CRAN (R 4.2.0)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0)
RcppAnnoy 0.0.19 2021-07-30 [1] CRAN (R 4.2.0)
RcppEigen 0.3.3.9.2 2022-04-08 [1] CRAN (R 4.2.0)
RCurl 1.98-1.8 2022-07-30 [1] CRAN (R 4.2.0)
rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.2.0)
reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.2.0)
reticulate * 1.28 2023-01-27 [1] CRAN (R 4.2.0)
rgeos 0.5-9 2021-12-15 [1] CRAN (R 4.2.0)
rjson 0.2.21 2022-01-09 [1] CRAN (R 4.2.0)
rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.0)
rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.2.0)
ROCR 1.0-11 2020-05-02 [1] CRAN (R 4.2.0)
rpart 4.1.16 2022-01-24 [1] CRAN (R 4.2.1)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
Rtsne 0.16 2022-04-17 [1] CRAN (R 4.2.0)
rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0)
S4Vectors * 0.34.0 2022-04-26 [1] Bioconductor
sandwich 3.0-2 2022-06-15 [1] CRAN (R 4.2.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
scattermore 0.8 2022-02-14 [1] CRAN (R 4.2.0)
scLANE * 0.6.2 2023-06-01 [1] local
sctransform 0.3.4 2022-08-20 [1] CRAN (R 4.2.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
Seurat * 4.1.1 2022-05-02 [1] CRAN (R 4.2.0)
SeuratObject * 4.1.0 2022-05-01 [1] CRAN (R 4.2.0)
shape 1.4.6 2021-05-19 [1] CRAN (R 4.2.0)
shiny 1.7.2 2022-07-19 [1] CRAN (R 4.2.0)
SingleCellExperiment * 1.18.0 2022-04-26 [1] Bioconductor
slingshot * 2.4.0 2022-04-26 [1] Bioconductor
sp * 1.5-0 2022-06-05 [1] CRAN (R 4.2.0)
sparseMatrixStats 1.8.0 2022-04-26 [1] Bioconductor
spatstat.core 2.4-4 2022-05-18 [1] CRAN (R 4.2.0)
spatstat.data 3.0-0 2022-10-21 [1] CRAN (R 4.2.0)
spatstat.geom 3.0-3 2022-10-25 [1] CRAN (R 4.2.0)
spatstat.random 3.0-1 2022-11-03 [1] CRAN (R 4.2.0)
spatstat.sparse 3.0-0 2022-10-21 [1] CRAN (R 4.2.0)
spatstat.utils 3.0-1 2022-10-19 [1] CRAN (R 4.2.0)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
stringr 1.4.1 2022-08-20 [1] CRAN (R 4.2.0)
SummarizedExperiment * 1.26.1 2022-05-01 [1] Bioconductor
survival 3.4-0 2022-08-09 [1] CRAN (R 4.2.0)
svglite 2.1.0 2022-02-03 [1] CRAN (R 4.2.0)
systemfonts 1.0.4 2022-02-11 [1] CRAN (R 4.2.0)
tensor 1.5 2012-05-05 [1] CRAN (R 4.2.0)
TH.data 1.1-1 2022-04-26 [1] CRAN (R 4.2.0)
tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.2.0)
tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0)
TMB 1.9.1 2022-08-16 [1] CRAN (R 4.2.0)
TrajectoryUtils * 1.4.0 2022-04-26 [1] Bioconductor
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
uwot 0.1.14 2022-08-22 [1] CRAN (R 4.2.0)
vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0)
viridis 0.6.2 2021-10-13 [1] CRAN (R 4.2.0)
viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.0)
webshot 0.5.3 2022-04-14 [1] CRAN (R 4.2.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
xfun 0.32 2022-08-10 [1] CRAN (R 4.2.0)
xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.0)
XVector 0.36.0 2022-04-26 [1] Bioconductor
yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
zlibbioc 1.42.0 2022-04-26 [1] Bioconductor
zoo 1.8-10 2022-04-15 [1] CRAN (R 4.2.0)
[1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
─ Python configuration ───────────────────────────────────────────────────────
python: /Users/jack/Desktop/Python/science/venv/bin/python
libpython: /usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/config-3.8-darwin/libpython3.8.dylib
pythonhome: /Users/jack/Desktop/Python/science/venv:/Users/jack/Desktop/Python/science/venv
virtualenv: /Users/jack/Desktop/Python/science/venv/bin/activate_this.py
version: 3.8.16 (default, Dec 7 2022, 01:36:11) [Clang 14.0.0 (clang-1400.0.29.202)]
numpy: /Users/jack/Desktop/Python/science/venv/lib/python3.8/site-packages/numpy
numpy_version: 1.23.5
NOTE: Python version was forced by use_python function
──────────────────────────────────────────────────────────────────────────────
---
title: "Interpretable scRNA-seq Trajectory DE with `{scLANE}`"
author:
name: Jack Leary
email: j.leary@ufl.edu
affiliations:
- name: University of Florida
department: Biostatistics
city: Gainesville
state: FL
date: "`r Sys.Date()`"
format:
html:
code-fold: show
code-copy: true
code-tools: true
toc: true
embed-resources: true
fig-format: retina
fig-width: 9
fig-height: 6
df-print: kable
link-external-newwindow: true
execute:
cache: false
freeze: auto
---
```{r setup, include=FALSE}
reticulate::use_virtualenv("~/Desktop/Python/science/venv/", required = TRUE)
```
# Introduction
In this tutorial we'll walk through a basic trajectory differential expression analysis. We'll use the `{scLANE}` package, which we developed with the goal of providing accurate and biologically interpretable models of expression over the course of a biological process. At the end are a list of references we used in developing the method & writing the accompanying manuscript, as well as the poster I presented at [ENAR 2023](https://www.enar.org/meetings/spring2023/) in Nashville.
# Libraries
If you haven't already, install the development version (currently v`r packageVersion("scLANE")`) of `{scLANE}` from [the GitHub repository](https://github.com/jr-leary7/scLANE).
```{r, eval=FALSE}
remotes::install_github("jr-leary7/scLANE")
```
Next, we'll load the packages we need to process, analyze, & visualize our data.
```{r, message=FALSE, warning=FALSE, results='hide'}
library(dplyr) # data manipulation
library(scLANE) # trajectory DE
library(Seurat) # scRNA-seq tools
library(ggplot2) # plot utilities
library(patchwork) # plot combination
library(slingshot) # pseudotime estimation
library(reticulate) # Python interface
library(ComplexHeatmap) # heatmaps
rename <- dplyr::rename
```
# Helper Functions
We'll also define a couple utilities to make our plots cleaner to read & easier to make.
```{r}
theme_umap <- function(base.size = 14) {
ggplot2::theme_classic(base_size = base.size) +
ggplot2::theme(axis.ticks = ggplot2::element_blank(),
axis.text = ggplot2::element_blank(),
plot.subtitle = ggplot2::element_text(face = "italic", size = 11),
plot.caption = ggplot2::element_text(face = "italic", size = 11))
}
guide_umap <- function(key.size = 4) {
ggplot2::guides(color = ggplot2::guide_legend(override.aes = list(size = key.size, alpha = 1)))
}
```
And consistent color palettes will make our plots easier to understand.
```{r}
palette_cluster <- paletteer::paletteer_d("ggsci::default_jama")
palette_celltype <- paletteer::paletteer_d("ggsci::category20_d3")
palette_heatmap <- paletteer::paletteer_d("wesanderson::Zissou1")
```
# Data
We'll load the pancreatic endocrinogenesis data from [Bastidas-Ponce *et al* (2019)](https://doi.org/10.1242/dev.173849), which comes with the `scVelo` Python library & has been used in several pseudotime inference / RNA velocity method papers as a good benchmark dataset.
```{python, resuls='hide'}
import scvelo as scv
adata = scv.datasets.pancreas()
```
The `AnnData` object contains data that we'll need to extract, specifically the counts matrices (stored in `AnnData.layers`) and the cell-level metadata (which is in `AnnData.obs`).
```{python}
adata
```
The `{reticulate}` package allows us to pass the counts matrices & metadata from Python back to R. We'll use the spliced mRNA counts as our default assay, and also define a new assay containing the total (spliced + unspliced) mRNA in each cell. Lastly, we remove genes with non-zero spliced mRNA in 3 or fewer cells. **Note**: while downloading this dataset requires a Python installation as well as the installation of the `scVelo` Python library (and its dependencies), running `{scLANE}` is done purely in R & requires no Python whatsoever.
```{r}
spliced_counts <- Matrix::Matrix(t(as.matrix(py$adata$layers["spliced"])), sparse = TRUE)
unspliced_counts <- Matrix::Matrix(t(as.matrix(py$adata$layers["unspliced"])), sparse = TRUE)
rna_counts <- spliced_counts + unspliced_counts
colnames(rna_counts) <- colnames(spliced_counts) <- colnames(unspliced_counts) <- py$adata$obs_names$to_list()
rownames(rna_counts) <- rownames(spliced_counts) <- rownames(unspliced_counts) <- py$adata$var_names$to_list()
spliced_assay <- CreateAssayObject(counts = spliced_counts)
spliced_assay@key <- "spliced_"
unspliced_assay <- CreateAssayObject(counts = unspliced_counts)
unspliced_assay@key <- "unspliced_"
rna_assay <- CreateAssayObject(counts = rna_counts)
rna_assay@key <- "rna_"
meta_data <- py$adata$obs %>%
mutate(cell_name = rownames(.), .before = 1) %>%
rename(celltype = clusters,
celltype_coarse = clusters_coarse) %>%
mutate(nCount_spliced = colSums(spliced_counts),
nFeature_spliced = colSums(spliced_counts > 0),
nCount_unspliced = colSums(unspliced_counts),
nFeature_unspliced = colSums(unspliced_counts > 0),
nCount_rna = colSums(rna_counts),
nFeature_rna = colSums(rna_counts > 0))
seu <- CreateSeuratObject(counts = spliced_assay,
assay = "spliced",
project = "Mm_Panc_Endo",
meta.data = meta_data)
seu@assays$unspliced <- unspliced_assay
seu@assays$rna <- rna_assay
seu <- seu[rowSums(seu@assays$spliced) > 3, ]
```
```{r, echo=FALSE, message=FALSE, warning=FALSE, results='hide'}
system("rm -rf ./data") # remove data/ directory created by downloading pancreas .h5ad file
rm(meta_data, rna_assay, rna_counts, spliced_assay, spliced_counts, unspliced_assay, unspliced_counts)
```
We preprocess the counts using a typical pipeline with QC, normalization, linear & nonlinear dimension reduction, and graph-based clustering via the Leiden algorithm.
```{r, message=FALSE, warning=FALSE}
seu <- PercentageFeatureSet(seu,
pattern = "^mt-",
col.name = "percent_mito",
assay = "spliced") %>%
PercentageFeatureSet(pattern = "^Rp[sl]",
col.name = "percent_ribo",
assay = "spliced") %>%
NormalizeData(assay = "spliced", verbose = FALSE) %>%
NormalizeData(assay = "unspliced", verbose = FALSE) %>%
NormalizeData(assay = "rna", verbose = FALSE) %>%
FindVariableFeatures(assay = "spliced",
nfeatures = 3000,
verbose = FALSE) %>%
ScaleData(assay = "spliced",
vars.to.regress = c("percent_mito", "percent_ribo"),
model.use = "poisson",
verbose = FALSE) %>%
RunPCA(assay = "spliced",
npcs = 30,
approx = TRUE,
seed.use = 312,
verbose = FALSE) %>%
RunUMAP(reduction = "pca",
dims = 1:30,
n.components = 2,
metric = "cosine",
seed.use = 312,
verbose = FALSE) %>%
FindNeighbors(reduction = "pca",
k.param = 30,
nn.method = "annoy",
annoy.metric = "cosine",
verbose = FALSE) %>%
FindClusters(algorithm = 4,
resolution = 0.5,
random.seed = 312,
verbose = FALSE)
```
Let's visualize the results on our UMAP embedding. The clustering generally agrees with the celltype labels, though there is some overclustering in the ductal cells & underclustering in the mature endocrine celltypes.
```{r}
p0 <- DimPlot(seu,
group.by = "seurat_clusters",
pt.size = 1,
cols = alpha(palette_cluster, 0.75)) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Leiden Cluster") +
theme_umap() +
theme(plot.title = element_blank()) +
guide_umap()
p1 <- DimPlot(seu,
group.by = "celltype",
pt.size = 1,
cols = alpha(palette_celltype, 0.75)) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Celltype") +
theme_umap() +
theme(plot.title = element_blank()) +
guide_umap()
p2 <- (p0 / p1) +
plot_annotation(title = "Murine Pancreatic Endocrinogenesis",
theme = theme_classic(base_size = 14))
p2
```
# Trajectory Inference
## Pseudotime Estimation
We'll start by fitting a trajectory using the `{slingshot}` R package. We define cluster 4 as the starting cluster, since in this case we're already aware of the dataset's underlying biology. After generating the estimates for each cell, we rescale the ordering to be defined on $[0, 1]$. This has no effect on the trajectory DE results however, and is mostly an aesthetic choice.
```{r}
sling_res <- slingshot(Embeddings(seu, "umap"),
clusterLabels = seu$seurat_clusters,
start.clus = "4",
approx_points = 1000)
sling_pt <- slingPseudotime(sling_res) %>%
as.data.frame() %>%
magrittr::set_colnames(c("PT")) %>%
mutate(PT = (PT - min(PT)) / (max(PT) - min(PT)))
seu <- AddMetaData(seu,
metadata = sling_pt,
col.name = "sling_pt")
```
Let's visualize the results on our UMAP embedding. They match what we would expect (knowing the biological background of the data), with ductal cells at the start of the process and endocrine celltypes such as alpha, beta, & delta cells at the end of it.
```{r}
p3 <- Embeddings(seu, "umap") %>%
as.data.frame() %>%
magrittr::set_colnames(c("UMAP_1", "UMAP_2")) %>%
mutate(PT = sling_pt$PT) %>%
ggplot(aes(x = UMAP_1, y = UMAP_2, color = PT)) +
geom_point(size = 1, alpha = 0.75) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Pseudotime") +
scale_color_gradientn(colors = palette_heatmap,
labels = scales::label_number(accuracy = 0.1)) +
theme_umap()
p4 <- (p3 / p1) +
plot_annotation(title = "Estimated Cell Ordering from Slingshot",
theme = theme_classic(base_size = 14))
p4
```
## Trajectory Differential Expression
Next, we prepare the primary inputs to `{scLANE}`: a dense counts matrix (with cells as rows and genes as columns - this is important), a dataframe containing our estimated pseudotime ordering, and a character vector of the genes that we're interested in modeling. We parallelize over genes in order to speed up the computation at the expense of using a little more memory. The models are fit using NB GLMs with optimal spline knots identified empirically, and differential expression is quantified using a likelihood ratio test of the fitted model vs. the null (intercept-only) model. In practice, genes designated as HVGs are usually the best candidates for modeling, so we choose the top 3,000 HVGs as our input. **Note**: the testing of the HVG set on its own is also justified by the reality that almost all trajectories are inferred using some sort of dimension-reduced space, and those embeddings are nearly universally generated using a set of HVGs. As such, genes not included in the HVG set actually have no direct relationship with the estimated trajectory, & it's generally safe to exclude them from trajectory analyses.
```{r, results='hold'}
top3k_hvg <- HVFInfo(seu) %>%
arrange(desc(variance.standardized)) %>%
slice_head(n = 3000) %>%
rownames(.)
raw_counts <- t(as.matrix(seu@assays$spliced@counts[top3k_hvg, ]))
scLANE_res <- testDynamic(expr.mat = raw_counts,
pt = sling_pt,
n.potential.basis.fns = 4,
parallel.exec = TRUE,
n.cores = 5,
track.time = TRUE)
```
We pull a sample of 10 genes from the results (which we clean up using `getResultsDE()`) & display their test statistics. By default, any gene with an adjusted *p*-value less than 0.01 is predicted to be dynamic, though this threshold can be easily adjusted.
```{r}
scLANE_res_tidy <- getResultsDE(test.dyn.results = scLANE_res)
set.seed(629)
select(scLANE_res_tidy,
Gene,
Test_Stat,
P_Val,
P_Val_Adj,
Gene_Dynamic_Overall) %>%
mutate(Gene_Dynamic_Overall = if_else(Gene_Dynamic_Overall == 1, "Dynamic", "Static")) %>%
slice_sample(n = 10) %>%
kableExtra::kbl(digits = 5,
booktabs = TRUE,
col.names = c("Gene", "LRT Statistic", "P-value", "Adj. P-value", "Predicted Gene Status")) %>%
kableExtra::kable_classic(full_width = FALSE, "hover")
```
Next, we can use the `plotModels()` function to visualize the fitted models from `{scLANE}` and compare them to other modeling methods. The gene [*Neurog3*](https://www.ncbi.nlm.nih.gov/gene/11925) is strongly associated with epithelial cell differentiation, and indeed we see a very clear, nonlinear transcriptional dynamic across pseudotime for that gene. A traditional GLM fails to capture that nonlinearity, and a GAM over-smooths the trend and does not accurately model the sharpness of the transcriptional switch that occurs halfway through the trajectory. Only the `scLANE` model accurately models the rapid upregulation and equally swift downregulation of *Neurog3* over pseudotime thanks to its adaptive choice of knots & piecewise linear nature.
```{r, fig.width=9, fig.height=5}
p5 <- plotModels(scLANE_res,
gene = "Neurog3",
pt = sling_pt,
plot.null = FALSE,
gene.counts = raw_counts) +
scale_color_manual(values = c("forestgreen"))
p5
```
We can check out the actual regression output for our gene of interest as well. The estimated knot is placed at `r unique(scLANE_res$Neurog3$Lineage_A$MARGE_Slope_Data$Rounded_Breakpoint)`.
```{r}
scLANE_res$Neurog3$Lineage_A$MARGE_Summary %>%
mutate(term = gsub("B_final", "", term),
term = if_else(term == "Intercept", term, paste0("h", term))) %>%
kableExtra::kbl(digits = 3,
booktabs = TRUE,
caption = "scLANE Model Output for <i>Neurog3<\\i>",
col.names = c("Hinge Function", "Coefficient", "Std. Error", "T-statistic", "P-value")) %>%
kableExtra::kable_classic(full_width = FALSE, "hover")
```
Using the `getFittedValues()` function allows us to generate predictions from the models we fit, which we then use to visualize the dynamics of a few genes that are known to be strongly associated with the differentiation of immature cells into mature endocrine phenotypes. For all four genes, the fitted models show knots chosen in the area of pseudotime around the pre-endocrine cells. This tells us that these driver genes are being upregulated in precursor celltypes & are driving differentiation into the mature celltypes such as alpha & beta cells, after which the genes are downregulated.
```{r, fig.width=12, fig.height=6}
p6 <- getFittedValues(test.dyn.res = scLANE_res,
genes = c("Chga", "Chgb", "Fev", "Cck"),
pt = sling_pt,
expr.mat = raw_counts,
cell.meta.data = select(seu@meta.data, celltype, celltype_coarse)) %>%
ggplot(aes(x = pt, y = expression)) +
facet_wrap(~gene,
ncol = 2,
scales = "free_y") +
geom_point(aes(color = celltype), size = 1, alpha = 0.75) +
geom_vline(data = data.frame(gene = "Chga", knot = unique(scLANE_res$Chga$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Chgb", knot = unique(scLANE_res$Chgb$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Cck", knot = unique(scLANE_res$Cck$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_vline(data = data.frame(gene = "Fev", knot = unique(scLANE_res$Fev$Lineage_A$MARGE_Slope_Data$Breakpoint)),
mapping = aes(xintercept = knot),
linetype = "dashed",
color = "grey20") +
geom_ribbon(aes(ymin = scLANE_ci_ll, ymax = scLANE_ci_ul),
size = 0,
fill = "grey",
alpha = 1) +
geom_line(aes(y = scLANE_fitted),
color = "black",
size = 0.75) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.1)) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime",
y = "Expression",
title = "Endrocrinogenesis Driver Genes Across Pseudotime",
subtitle = "scLANE piecewise negative binomial GLMs") +
theme_classic(base_size = 14) +
theme(legend.title = element_blank(),
strip.text.x = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic", size = 11)) +
guide_umap()
p6
```
On the other hand, if we use additive models the "peak" of expression is placed among the mature endocrine celltypes - which doesn't make biological sense if we know that these genes are driving that process of differentiation. This can of course be tweaked by changing the degree or degrees of freedom of the underlying basis spline, but choosing a "best" value for those hyperparameters can be difficult, whereas `scLANE` identifies optimal parameters internally by default.
```{r, fig.width=12, fig.height=6}
p7 <- getFittedValues(test.dyn.res = scLANE_res,
genes = c("Chga", "Chgb", "Fev", "Cck"),
pt = sling_pt,
expr.mat = raw_counts,
cell.meta.data = select(seu@meta.data, celltype, celltype_coarse)) %>%
with_groups(gene,
mutate,
GAM_fitted_link = predict(gamlss::gamlss(expression ~ splines::bs(pt, degree = 3),
family = "NBI",
control = gamlss::gamlss.control(trace = FALSE))),
GAM_se_link = predict(gamlss::gamlss(expression ~ splines::bs(pt, degree = 3),
family = "NBI",
control = gamlss::gamlss.control(trace = FALSE)),
se.fit = TRUE)[[2]]) %>%
mutate(GAM_fitted = exp(GAM_fitted_link),
GAM_ci_ll = exp(GAM_fitted_link - qnorm(0.975, lower.tail = FALSE) * GAM_se_link),
GAM_ci_ul = exp(GAM_fitted_link + qnorm(0.975, lower.tail = FALSE) * GAM_se_link)) %>%
ggplot(aes(x = pt, y = expression)) +
facet_wrap(~gene,
ncol = 2,
scales = "free_y") +
geom_point(aes(color = celltype), size = 1, alpha = 0.75) +
geom_ribbon(aes(ymin = GAM_ci_ll, ymax = GAM_ci_ul),
size = 0,
fill = "grey",
alpha = 1) +
geom_line(aes(y = GAM_fitted),
color = "black",
size = 0.75) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.1)) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime",
y = "Expression",
title = "Endrocrinogenesis Driver Genes Across Pseudotime",
subtitle = "Cubic basis spline negative binomial GAMs") +
theme_classic(base_size = 14) +
theme(legend.title = element_blank(),
strip.text.x = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic", size = 11)) +
guide_umap()
p7
```
Let's take a broader view of the dataset by examining the distribution of adaptively chosen knots from our models. We limit the analysis to the set of genes determined to be dynamic.
```{r}
dyn_genes <- filter(scLANE_res_tidy, Gene_Dynamic_Overall == 1) %>%
pull(Gene)
knot_df <- purrr::imap(scLANE_res[dyn_genes],
\(x, y) {
data.frame(
gene = y,
knot = x$Lineage_A$MARGE_Slope_Data$Breakpoint
)
}) %>%
purrr::reduce(rbind)
```
We'll plot a histogram of the knot values along with a ridgeplot of the pseudotime distribution for each celltype. We see that the majority of the selected knots are placed at the beginning of the trajectory, around where the ductal cells transition into endocrine progenitors. A smaller set of knots is placed about halfway through the trajectory, which we've annotated as the point at which pre-endocrine cells begin differentiating into mature endocrine phenotypes.
```{r, fig.width=9, fig.height=6}
p8 <- ggplot(knot_df, aes(x = knot)) +
geom_density(fill = "deepskyblue3",
alpha = 0.75,
color = "deepskyblue4",
size = 1) +
scale_x_continuous(limits = c(0, 1), labels = scales::label_number(accuracy = 0.1)) +
labs(x = "Knot Location") +
theme_classic(base_size = 14) +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
p9 <- data.frame(celltype = seu$celltype,
pt = seu$sling_pt) %>%
ggplot(aes(x = pt, y = celltype, fill = celltype, color = celltype)) +
ggridges::geom_density_ridges(alpha = 0.75, size = 1) +
scale_x_continuous(labels = scales::label_number(accuracy = 0.1)) +
scale_fill_manual(values = palette_celltype) +
scale_color_manual(values = palette_celltype) +
labs(x = "Pseudotime") +
theme_classic(base_size = 14) +
theme(axis.title.y = element_blank(),
legend.title = element_blank()) +
guide_umap()
p10 <- (p8 / p9) +
plot_layout(heights = c(1, 1.75)) +
plot_annotation(title = "Distribution of Adaptively-chosen Knots from scLANE",
theme = theme_classic(base_size = 14))
p10
```
We can extract a matrix of fitted values using `smoothedCountsMatrix()`; here we focus on the top 1,000 most dynamic genes, with the goal of identifying clusters of similarly-expressed genes. After reducing dimensionality with PCA, we cluster the genes using the Leiden algorithm & embed the genes in two dimensions with UMAP.
```{r}
smoothed_counts <- smoothedCountsMatrix(scLANE_res,
genes = dyn_genes[1:2000],
parallel.exec = TRUE,
n.cores = 2)
set.seed(312)
smoothed_counts_pca <- irlba::prcomp_irlba(t(smoothed_counts$Lineage_A),
n = 30,
center = TRUE,
scale. = TRUE)
smoothed_counts_umap <- uwot::umap(smoothed_counts_pca$x,
n_components = 2,
metric = "cosine",
n_neighbors = 20,
init = "spectral")
smoothed_counts_snn <- bluster::makeSNNGraph(smoothed_counts_pca$x,
k = 20,
type = "jaccard",
BNPARAM = BiocNeighbors::AnnoyParam(distance = "Cosine"))
smoothed_counts_clust <- igraph::cluster_leiden(smoothed_counts_snn,
objective_function = "modularity",
resolution_parameter = 0.3)
gene_clust_df <- data.frame(gene = colnames(smoothed_counts$Lineage_A),
umap1 = smoothed_counts_umap[, 1],
umap2 = smoothed_counts_umap[, 2],
leiden = as.factor(smoothed_counts_clust$membership - 1L))
```
The embedding & clustering show that even with the relatively small number of genes, clear patterns are visible.
```{r}
p11 <- ggplot(gene_clust_df, aes(x = umap1, y = umap2, color = leiden)) +
geom_point(size = 1, alpha = 0.75) +
labs(x = "UMAP 1",
y = "UMAP 2",
color = "Leiden Cluster",
title = "Unsupervised Clustering of Dynamic Genes",
subtitle = "Top 2,000 genes after PCA") +
paletteer::scale_color_paletteer_d("ggsci::default_igv") +
theme_umap() +
guide_umap()
p11
```
We can also plot a heatmap of the dynamic genes; this requires a bit of setup, for which we'll use the `{ComplexHeatmap}` package. We scale each gene, and clip values to be on $[-6, 6]$. The columns of the heatmap are ordered by estimated pseudotime.
```{r}
col_anno_df <- select(seu@meta.data,
cell_name,
celltype,
sling_pt) %>%
mutate(celltype = as.factor(celltype)) %>%
arrange(sling_pt)
heatmap_mat <- t(scale(smoothed_counts$Lineage_A))
heatmap_mat[heatmap_mat > 6] <- 6
heatmap_mat[heatmap_mat < -6] <- -6
colnames(heatmap_mat) <- seu$cell_name
heatmap_mat <- heatmap_mat[, col_anno_df$cell_name]
palette_celltype_hm <- as.character(palette_celltype[1:length(unique(seu$celltype))])
names(palette_celltype_hm) <- levels(col_anno_df$celltype)
col_anno <- HeatmapAnnotation(Celltype = col_anno_df$celltype,
Pseudotime = col_anno_df$sling_pt,
col = list(Celltype = palette_celltype_hm,
Pseudotime = circlize::colorRamp2(seq(0, 1, by = 0.25), palette_heatmap)),
show_legend = TRUE,
show_annotation_name = FALSE,
gap = unit(1, "mm"),
border = TRUE)
palette_cluster_hm <- as.character(paletteer::paletteer_d("ggsci::default_igv")[1:length(unique(gene_clust_df$leiden))])
names(palette_cluster_hm) <- as.character(unique(gene_clust_df$leiden))
row_anno <- HeatmapAnnotation(Cluster = as.factor(gene_clust_df$leiden),
col = list(Cluster = palette_cluster_hm),
show_legend = TRUE,
show_annotation_name = FALSE,
annotation_legend_param = list(title = "Gene\nCluster"),
gap = unit(1, "mm"),
border = TRUE,
which = "row")
```
The heatmap shows clear dynamic patterns across pseudotime, and the hierarchical clustering agrees fairly well with our graph-based clustering from earlier.
```{r, fig.width=12, fig.height=6}
Heatmap(matrix = heatmap_mat,
name = "Spliced\nmRNA",
col = circlize::colorRamp2(colors = viridis::inferno(50),
breaks = seq(min(heatmap_mat), max(heatmap_mat), length.out = 50)),
cluster_columns = FALSE,
show_column_dend = FALSE,
cluster_column_slices = FALSE,
width = 12,
height = 6,
column_title = "Dynamic Genes Across Pseudotime in Murine Pancreatic Endocrinogenesis",
column_title_gp = gpar(fontface = "bold"),
show_row_dend = TRUE,
top_annotation = col_anno,
left_annotation = row_anno,
show_column_names = FALSE,
show_row_names = FALSE,
use_raster = TRUE,
raster_by_magick = TRUE,
raster_quality = 5)
```
Using our gene clusters & the `{gprofiler2}` package, we run an enrichment analysis against [the biological process (BP) set of gene ontologies](http://geneontology.org/docs/ontology-documentation/).
```{r}
gene_clust_list <- purrr::map(unique(gene_clust_df$leiden), \(x) filter(gene_clust_df, leiden == x) %>% pull(gene))
names(gene_clust_list) <- paste0("Leiden_", unique(gene_clust_df$leiden))
enrich_res <- gprofiler2::gost(gene_clust_list,
organism = "mmusculus",
ordered_query = FALSE,
multi_query = FALSE,
sources = "GO:BP",
significant = TRUE)
```
A look at the top 3 most-significant GO terms for each gene cluster reveals heterogeneous functionalities across groups of genes:
```{r}
mutate(enrich_res$result,
query = gsub("Leiden_", "", query)) %>%
rename(cluster = query) %>%
with_groups(cluster,
slice_head,
n = 3) %>%
select(cluster, term_name, p_value, term_size, query_size, intersection_size, term_id) %>%
kableExtra::kbl(digits = 5,
booktabs = TRUE,
caption = "<i>Top 3 Biological Process GO Terms per Cluster<\\i>",
col.names = c("Gene Cluster", "Term Name", "Adj. P-value", "Term Size",
"Query Size", "Intersection Size", "Term ID")) %>%
kableExtra::kable_classic(c("hover"), full_width = FALSE)
```
# Session Info
```{r}
sessioninfo::session_info()
```
```{r, echo=FALSE, include=FALSE, results='hide', eval=FALSE}
ggsave(plot = p0,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Leiden_Cluster_UMAP.pdf",
device = "pdf",
width = 9,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p1,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Celltype_UMAP.pdf",
device = "pdf",
width = 9,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p2,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Leiden_Cluster_and_Celltype_Combined_UMAP.pdf",
device = "pdf",
width = 9,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p3,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Slingshot_Pseudotime_UMAP.pdf",
device = "pdf",
width = 9,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p4,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Slingshot_Pseudotime_and_Celltype_Combined_UMAP.pdf",
device = "pdf",
width = 9,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p5,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Neurog3_Gene_Dynamics_Model_Comparison.pdf",
device = "pdf",
width = 12,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p6,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Endocrine_Driver_Gene_Dynamics_scLANE.pdf",
device = "pdf",
width = 12,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p7,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Endocrine_Driver_Gene_Dynamics_GAM.pdf",
device = "pdf",
width = 12,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p8,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Dynamic_Gene_Knot_Distribution.pdf",
device = "pdf",
width = 9,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p9,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Pseudotime_Density_by_Celltype_Ridgeline.pdf",
device = "pdf",
width = 9,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p10,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Dynamic_Gene_Knot_Distribution_and_Celltype_Pseudotime_Ridgeline_Combined.pdf",
device = "pdf",
width = 9,
height = 6,
dpi = "retina",
units = "in")
ggsave(plot = p11,
path = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/",
filename = "Leiden_Cluster_of_Top_Dynamic_Genes_UMAP.pdf",
device = "pdf",
width = 9,
height = 6,
dpi = "retina",
units = "in")
pdf("~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/scLANE_Pseudotime_Heatmap.pdf",
width = 12,
height = 6)
Heatmap(matrix = heatmap_mat,
name = "Spliced\nmRNA",
col = circlize::colorRamp2(colors = viridis::inferno(50),
breaks = seq(min(heatmap_mat), max(heatmap_mat), length.out = 50)),
cluster_columns = FALSE,
show_column_dend = FALSE,
cluster_column_slices = FALSE,
width = 12,
height = 6,
column_title = "Dynamic Genes Across Pseudotime in Murine Pancreatic Endocrinogenesis",
column_title_gp = gpar(fontface = "bold"),
show_row_dend = TRUE,
top_annotation = col_anno,
left_annotation = row_anno,
show_column_names = FALSE,
show_row_names = FALSE,
use_raster = TRUE,
raster_by_magick = TRUE,
raster_quality = 5)
dev.off()
openxlsx::write.xlsx(enrich_res$result,
file = "~/Desktop/PhD/Research/scLANE_Project/Writing/Panc_Endo_Case_Study/GOBP_Enrichment.xlsx",
overwrite = TRUE)
```